{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 概率论和信息论笔记\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 常用概率分布\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bernoulli分布\n",
    "\n",
    "Bernoulli分布是单个二值随机变量的分布。它由单个参数$\\phi\\in[0,1]$控制，$\\phi$给出了随机变量等于1的概率。它有如下一些性质：\n",
    "\n",
    "\\begin{align}\n",
    "P(\\text{x}=1)&=\\phi \\\\\n",
    "P(\\text{x}=0)&=1-\\phi \\\\\n",
    "P(\\text{x}=x)&=\\phi^{x}(1-\\phi)^{1-\\phi} \\\\\n",
    "\\mathbb{E}_{\\text{x}}[\\text{x}]&=\\phi \\\\\n",
    "\\text{Var}_{\\text{x}}(\\text{x})&=\\phi(1-\\phi)\n",
    "\\end{align}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multinoulli分布\n",
    "\n",
    "Multinoulli分布或者**范畴分布**是指**在具有k个不同状态的单个离散型随机变量上的分布，其中k是一个有限值**。\n",
    "\n",
    "Multinoulli分布通常用来表示对象分类的分布。我们通常不需要计算Multinoulli分布的期望和方差。\n",
    "\n",
    "Bernoulli分布和Multinoulli分布走狗用来描述它们领域内的任何分布，因为它们领域很简单：**它们可以对那些能够将所有状态进行枚举的离散型随机变量进行建模**。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 高斯分布或者正态分布\n",
    "\n",
    "实数上最常用的分布就是**正太分布(normal distribution)**，也称为**高斯分布Gaussian distribution**。\n",
    "\n",
    "公式如下：\n",
    "\n",
    "$$\\mathcal{N}(x;\\mu,\\sigma^2)=\\sqrt\\frac{1}{2\\pi\\sigma^2}\\text{exp}(-\\frac{1}{2\\sigma^2}(x-\\mu)^2)$$\n",
    "\n",
    "\n",
    "高斯分布如下图所示：\n",
    "![Gaussian distribution](images/gaussian_distribution.png)\n",
    "\n",
    "我们用$\\beta^{-1}$替代$\\sigma^2$，表示**精度(precision)**，那么有：\n",
    "\n",
    "$$\\mathcal{N}(x;\\mu,\\beta^{-1})=\\sqrt\\frac{\\beta}{2\\pi}\\text{exp}(-\\frac{1}{2}\\beta(x-\\mu)^2)$$\n",
    "\n",
    "当我们由于缺乏关于某个实数上分布的先验知识而不知道该选择怎样的形式时，正态分布是默认的比较好的选择，其中有两个原因。\n",
    "* 我们想要建模的很多分布的真实情况是比较接近正态分布的，**中心极限定理（central limit theorem）**说明很多独立随机变量的和近似服从正态分布\n",
    "* 在具有相同方差的所有可能的概率分布中， 正态分布在实数上具有最大的不确定性，因此，我们可以认为正态分布是对模型加入的先验知识量最少的分布\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 指数分布和Laplace分布\n",
    "\n",
    "在深度学习中，我们经常会需要一个在 `x = 0` 点处取得**边界点 (sharp point)** 的分布。为了实现这一目的，我们可以使用**指数分布（exponential distribution）**：\n",
    "\n",
    "$$p(x;\\lambda)=\\lambda\\mathbb{1}_{x>=0} exp(-\\lambda x)$$\n",
    "\n",
    "其中，**指示函数(indicator function)**$\\mathbb{1}_{x>=0}$使得$x$取负数的时候值为零。\n",
    "\n",
    "\n",
    "一个联系紧密的概率分布是**Laplace 分布（Laplace distribution）**，它允许我们在任意一点 µ 处设置概率质量的峰值:\n",
    "\n",
    "$$Laplace(x;\\mu,\\gamma)=\\frac{1}{2\\gamma}exp(-\\frac{\\vert{x-\\mu}\\vert}{\\gamma})$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 常用函数的有用性质\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### sigmoid函数\n",
    "\n",
    "sigmoid函数常用来产生Bernoulli分布的参数$\\phi$，因为它的范围是(0,1)，处在$\\phi$的有效值范围。它的公式如下：\n",
    "\n",
    "$$\\sigma(x)=\\frac{1}{1+e^{-x}}$$\n",
    "\n",
    "![sigmoid](images/sigmoid.jpeg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### softplus函数\n",
    "\n",
    "softplus 函数可以用来产生正态分布的 β 和 σ 参数，因为它的范围是 (0, $\\infty$)。它的公式如下：\n",
    "\n",
    "$$\\zeta(x)=\\log(1+e^x)$$\n",
    "\n",
    "![softplus](images/softplus.png)\n",
    "\n",
    "它的名称来源是因为它是另一个函数的“软化”，这个函数是：\n",
    "\n",
    "$$x^+=max(0,x)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以下是一些很有用的性质：\n",
    "\n",
    "\\begin{align}\n",
    "\\sigma(x)&=\\frac{1}{1+e^{-x}}=\\frac{e^x}{e^x+e^0} \\\\\n",
    "\\frac{d}{d_x}\\sigma(x)&=\\sigma(x)(1-\\sigma(x)) \\\\\n",
    "1-\\sigma(x)&=\\sigma(-x) \\\\\n",
    "\\log(\\sigma(x))&=-\\zeta(-x) \\\\\n",
    "\\frac{d}{d_x}\\zeta(x)&=\\sigma(x) \\\\\n",
    "\\forall{x} \\in (0,1),\\sigma(x)^{-1}&=\\log(\\frac{x}{1-x}) \\\\\n",
    "\\forall{x}>0,\\zeta(x)^{-1}&=\\log(exp(x)-1) \\\\\n",
    "\\zeta(x)&=\\int_{-\\infty}^{+\\infty}\\sigma(y)d_y \\\\\n",
    "\\zeta(x)-\\zeta(-x)&=x\n",
    "\\end{align}\n",
    "\n",
    "函数$\\sigma(x)^{-1}$在统计学中称为**分对数(logits)**，但是这个函数在机器学习里面很少用到。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 贝叶斯法则\n",
    "\n",
    "我们经常会需要在已知$P(y\\vert x)$时计算$P(x\\vert y)$。幸运的是，如果还知道$P(x)$，我们可以用**贝叶斯规则（Bayes’ rule）**来实现这一目的：\n",
    "\n",
    "$$P(x\\vert y)=\\frac{P(x)P(y\\vert x)}{P(y)}$$\n",
    "\n",
    "其中,\n",
    "\n",
    "$$P(y)=\\sum_xP(y\\vert x)P(x)$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 信息论\n",
    "\n",
    "定义一个事件$\\text{x}=x$的**自信息量(self-information)**：\n",
    "\n",
    "$$I(x)=-\\log P(x)$$\n",
    "\n",
    "其中$\\log$以e为底，$I(x)$的单位是**奈特(nats)**，$\\log$以2为底，$I(x)$的单位是**比特(bit)或者香农(shannons)**。默认使用e为底。\n",
    "\n",
    "自信息只处理单个的输出。我们可以用**香农熵（Shannon entropy）**来对整个概率分布中的不确定性总量进行量化：\n",
    "\n",
    "$$H(x)=\\mathbb{E}_{x\\sim P}[I(x)]=-\\mathbb{E}_{x\\sim P}[\\log P(x)]$$\n",
    "\n",
    "也记做$H(P)$。\n",
    "\n",
    "换言之，**一个分布的香农熵是指遵循这个分布的事件所产生的期望信息总量**。\n",
    "\n",
    "如果我们对于同一个随机变量$x$有两个单独的概率分布$P(x)$ 和$Q(x)$，我们可以使用**KL 散度（Kullback-Leibler (KL) divergence）**来衡量这两个分布的差异：\n",
    "\n",
    "$$D_{KL}(P\\parallel Q)=\\mathbb{E}_{x\\sim P}[\\log\\frac{P(x)}{Q(x)}]=\\mathbb{E}_{x\\sim P}[\\log P(x)-\\log Q(x)]$$\n",
    "\n",
    "一个和 KL 散度密切联系的量是**交叉熵（cross-entropy）**$H(P;Q) = H(P) +D_{KL}(P\\parallel Q)$，它和KL散度很像但是缺少左边一项：\n",
    "\n",
    "$$H(P,Q)=-\\mathbb{E}_{x\\sim P}\\log Q(x)$$"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
